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Abstract 

Many researchers fail to understand that 
reliability is a function of scores, not tests. This paper 
provides an explanation of the distinction as well as a 
description of the reliability generalization meta-analysis 
technique. Reliability generalization meta-analysis can 
provide a way to aggregate test score reliability 
coefficients from prior studies, based on the 
characteristics of those studies. The resulting information 
can help researchers anticipate score reliability and 
identify characteristics for improving score reliability. 
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It is unfortunately all too common to find authors of 
education and psychology journal articles describing the 
"reliability of the test" or stating that "the test is 
reliable." Such statements fail to recognize that 
reliability is a characteristic of scores, and not of 
tests. As {4 /id Pedhazur & Schmelkin 1991} noted, 
"Statements about the reliability of a measure are 
[inherently] inappropriate and potentially misleading" (p. 
82) . 

Similarly, Gronlund & Linn (1990) emphasized that the 
"results" are reliable, rather than "an evaluation 
instrument." They wrote. 

Reliability refers to the results obtained with 
an evaluation instrument not to the instrument 
itself... Thus, it is more appropriate to speak of 
the reliability of the "test scores" or the 
"measurement" than of the "test" or the 

"instrument" (p. 78, emphasis in original) . 

Rowley (1976) noted, "It needs to be established that 
an instrument itself is neither reliable or unreliable... A 
single instrument can produce scores which are reliable, 
and other scores which are unreliable" (p. 53, emphasis 
added) . To summarize, it must be clearly understood that a 
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test is not 'reliable' or 'unreliable' . Rather, 

"reliability is a property of the scores on a test for a 

group of examinees" Crocker & Algina (1986) p. 
144, emphasis added. 

Because tests are not reliable per se, this means that 
score reliability fluctuates from study to study, and must 
be investigated in each study. The purpose of the present 
paper is to explain an innovative new method for evaluating 
the sources of score measurement error variances as these 
occur across studies: the reliability generalization method 
(Vacha-Haase, 1998) . 

Reliability generalization is an extension of the 
notable method, validity generalization, described by 
Schmidt & Hunter (1977) and Hunter & Schmidt (1990). In 
validity generalization inquiries (Schmidt & Hunter, 1977) , 
studies are used as the unit of analysis, and means, 
standard deviations and other descriptive statistics are 
computed for the validity coefficients across studies. The 
validity coefficients across studies may also be used as 
the dependent variables in regression or other analyses. In 
these analyses, the features of the studies (e.g., sample 
size, types of samples, ages of participants) that best 
predict the variations in the obtained validity 
coefficients are investigated. 
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The same thing can be done to investigate reliability 
coefficients for a given measure across studies, as 
proposed by Vacha-Haase (1998) . The method can be used to 
characterize for a given test (a) the typical reliability 
of scores across studies, (b) the amount of variability in 
reliability coefficients, and (c) the sources of 
variability in reliability coefficients across studies. The 
present paper provides an accessible summary of Vacha- 
Haase' s important reliability generalization method. 

The reliability generalization process initially 
requires the researcher to identify all prior studies that 
report reliability coefficients for the test under 
investigation. Studies must use the same methods for 
measuring reliability. 

Huck & Cormier (1996) list three classical methods for 
measuring internal consistency reliability: split-half 
reliability coefficient, Kuder-Richardson #20, (also known 
as KR-20) , and Cronbach's alpha. Of course, even more 
sophisticated "modern" (i.e., non-classical reliability 
coefficients can also- be computed, such as Generalizability 
Coefficients (cf. Eason, 1991; Thompson, 1991)). Huck & 
Cormier (1996) emphasize that reliability estimates do not 
necessarily generalize across methods, so it is important 
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to identify the types of reliability statistics that will 
be used in the reliability generalization study. 

The next step in the^ reliability generalization 
process is to identify common information that is provided 
in each study (e.g., sample, size, gender, age), as well as 
ri^tural subscore divisions that may be reported 
appropriately for the test. These data are then coded and 
each piece of information becomes a dependent variable for 
the statistical analysis of the reliability scores of the 
test . 

Vacha-Haase (1998) demonstrates statistical treatments 
of reliability coefficients that can be applied as a part 
of the Reliability Generalization analysis. Descriptive 
statistics can be computed to describe central tendency and 
variability of reliability coefficients across studies. 
These statistics can give researchers a benchmark to 
compare reliability coefficients for scores in their study. 
Further statistical analyses can be conducted to discover 
which variables (e.g., sample size, type of reliability 
coefficient, characteristics of study participants) 

most to, (or detract most from) , test score 
reliability. 
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Examples of Reliability Estimates 
The following excerpts from prior studies of the NEO- 
PI-R (Costa & McCrae, 1992) , a measure of the five-factor 
model of personality, will demonstrate the language that is 
typically used to describe reliability data. These excerpts 
will also demonstrate some decisions that the researcher 
may need to make in collecting the reliability 
generalization data. In addition, examples of possible 
independent variables are given that a researcher might 
choose to include in a reliability generalization analysis. 

Some studies report reliability estimates from prior 
research rather than reliability estimates for the data in 
, the current study. For example, McCrae (1987) wrote 
"Internal consistency and 6-month retest reliability for 
the Neuroticism, Extraversion, and Openness scales range 
from .85 to .93" (p. 1260). In general, this practice can 
be identified by the reference to prior research. Studies 
that are reporting reliability estimates for their data 
will often precede their reliability estimates with a 
phrase such as "In the present study". 

Another way researchers support the reliability of 
their study without actually calculating estimates for 
their data is to make general statements about the 
reliability. For example, MacDonald, Anderson, Tsagarakis, 
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& Holland (1994) wrote: "a fair amount of research has been 
done and excellent support for the validity and reliability 
of the domains has been consistently reported" (p. 341) . 
Neither of these approaches provides any information about 
the reliability of the data in a particular study. These 
studies cannot be included in the reliability 
generalization meta-analysis. 

Another unusable form of reporting of reliability 
estimates is seen in (Lay, 1997) in which it was reported 
that ''Cronbach' s alpha coefficients across the three 
separate samples ranged from 0.85 to 0.90" (p. 271). Since 
it is unclear which reliability estimate should go with 
which variable, this study would be rejected for 
reliability analysis. 

The following excerpts are examples of studies that do 
report usable reliability estimates. All of the reliability 
estimates in the following examples are in the form of 
Cronbach's alphas. Judge, Martocchio, & Thoresen (1997) 
reported usable reliability data in the form of Cronbach' s 
alphas: "Coefficient alphas for the personality scales were 
as follows: Neuroticism, a = .91; Extraversion, a = .87; 
Openness to Experience, a = .92; Agreeableness, a = .82; and 
Conscientiousness, a= .88" (p. 751). Cellar, Miller, 
Doverspike, & Klawsky (1996) reported that "The internal 
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consistency reliabilities for the five factor scales 
calculated for our sample were .84 (n=359) , .72 (n=359) , 

.71 (n=359) , .78 (n=359) , and .85 (n=362) for Neuroticism, 
Extraversion, Openness to Experience, Agreeableness, and 
Conscientiousness, respectively" (p. 699). Note that the 
number of subjects is different for each of the five 
factors. The reliability generalization researcher could 
account for this by encoding a separate n for each 
subscale. Costa & McCrae (1995) reported "In the present 
sample, internal consistencies for the five domains were 
.92, .89, .89, .87, and .91 for N, E, 0, A, and C, 

respectively" (p. 312) . 

Some examples from the prior studies that could have 
been encoded and included as independent variables in the 
reliability generalization are: number of subjects, mean, 
and standard deviation for the test- scores. Participant 
^^^^^cteri sties that could have been included were age 

\ 

mean, median, range, standard deviation; gender; marital 
status; race/ethnicity; education level; number of 
children; and retirement status. Of course, the nature of 
the test and the type of study using the test will largely 
determine the dependent variables that are available for 
inclusion in the analysis. Once the data are encoded, 
statistical analyses (e.g., regression) can be conducted to 
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determine the influence of the encoded independent 
variables on test score reliability. 

Summary 

This paper has discussed the value of learning more 
about the psychometric properties of test scores through 
meta-analysis of score reliability across multiple studies. 
The examples of usable and unusable reports of reliability 
estimates^ as well as language used to identify them, 
should provide a guide for researchers conducting 
reliability generalization studies. The results of a 
reliability generalization study will provide researchers 
with a better understanding of the reliability of scores 
obtained for tests in their particular studies as well as 
test characteristics that contribute most to score 
reliability in future studies. 
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